Language Modeling

This project is part of the course work for Dr. Nikos Altras's class Natural Language processing at the University of Sheffield.

1. Introduction

The goal of this project is to implement three language models to perform sentence completion. In other words, given a sentence with a missing word, the language model should be able to choose the correct word from a list of potential candidates. Language models are widely used in natural language processing for tasks that go beyond sentence completion. The most obvious application is speech recognition and machine translation. For example, in speech recognition systems a language model is generally combined with an acoustic system that models the pronunciation of different words. For each sentence pronounced, the acoustic system returns a set of plausible sentences. The correct or most likely sentence is then determined by the language model.

A language model is defined as follows. First, we need to define $V$ as the set of all words in a language. Second, we can define a sentence in the language as a sequence of words such as

$$ x_1, x_2, x_3, ... x_n$$

where $x$ represents each word, $n >= 1$ and we assume $x_1$ and $x_n$ to be symbols such as $<s>$ and $</s>$.

Language models aim at defining a function $P(x_1,x_2,x_3,...x_n)$ or $P(x)$ which is the probability of a sentence in the language. In this project I will define the function $P (x_1, x_2, x_3, ...x_n)$ using three language models:

  1. A unigram language model (ULM): The unigram language model is the simplest of the three. In this model, each word in the sentence is taken as A stand alone entity and its probability is assumed to be independent from the probability of all other words in the same sentence. The probability of each word is simply its frequency over the total number of words in the vocabulary. With these conditions, $P(x_1,x_2,x_3,...x_n)$ can be defined as: $$P(x_1,x_2,x_3,...x_n) = \prod{P(x_n)}$$ In practice, given the sentence ”I like ice cream”, its probability can be calculated as $$P(I − like − ice − cream) = P(I)P(like)P(ice)P(cream)$$
  1. A bigram language model (BLM): One of the limitations of the ULM is that it does not take into account context since each word is processed individually and its probability is considered to be independent from other words in the sentence. This limitation can be overcome with a bigram or Markov language model. The main assumption of the Markov model is that the identity of a word is conditioned on the identity of the word that precedes it. In this case, $P (x_1, x_2, x_3, ...x_n)$ can be defined as: $$P(x_1,x_2,x_3,...x_n) = \prod{P(x_n|x_{n-1})}$$ where $P(x_n|x_{n-1}) = \frac{P(x_{n-1}, x_n)}{P(x_{n-1})}$

    Again, in practice the probability of the sentence ”I like ice cream” would be: $$P(I − like − ice − cream) = P(I| < s >)P(like|I)P(ice|like)P(cream|ice)P(< /s > |cream)$$

  1. A bigram language model with Laplace smoothing: What happens when we want to assess the probability of a word that is not in our training data? According to the literature on ULM or BLM, such words can either be discarded or they can be given a zero probability[1] [3]. The problem stands in the fact that, giving zero probability, will make the probability of the sentences containing such words zero as well regardless of any other features.

    The simplest solution to this problem is the add-one Laplace smoothing according to which $P(x_n|x_{n−1})$ is defined as, $$P(x_n|x_{n-1}) = \frac{P(x_{n-1}, x_n)+1}{P(x_{n-1} + |V|)}$$ where $|V|$ is the size of the vocabulary.

2. Implementation

The sentences to be completed with the three language models along with the two word options can be found here. The .txt file contains 10 sentences such as the one below.


I don't know __ to go out or not (weather/whether)


The text to train the language models is a collection of 500.000 sentences from the 1 Billion Word Benchmark. The following subsections will explain how the text was processed before analysis, they will describe the three language models and discuss the results obtained.

2.1 Pre-processing

The first step in processing the text was to read the testing text and to substitute the ”___” in the sentences with each option and returning a new testing set containing 20 phrases. For example, the sentence ”I don’t know to go out or not. (weather/whether)”, became ”I don’t know weather to go out or not” and ”I don’t know whether to go out or not”.

The second step, was to read the training text, and together with the new testing set, making it lowercase, stripping it from punctuation and attaching the starting and closing tags < s > and < /s >.

For both ULM and BLM, I decided to assign a probability of zero to tokens that were not present in the training text but appeared in the testing text. Lastly, accuracy of all models was calculated by awarding a score of 1 if the model predicted the correct sentence variation, 0.5 if the model returned both variations of sentences with the same likelihood and 0 if the model predicted the wrong sentence or assigned a zero probability to both variations. The total score of the model was then divided by the total number of sentences to complete. Accuracy scores range from 0 to 1, with 1 being a perfect score and 0 being the worst possible score.

To follow is the code of all the functions used to implement the three language models.

In [9]:
# import required packages
import re
from collections import Counter
import numpy as np
import itertools
from datetime import datetime
In [10]:
# start time
start_c=datetime.now()

# declare global variables
train_path = './news-corpus-500k.txt' 
test_path = './questions.txt'  

start = '<s> ' 
end = ' </s>'
# list of correct sentences against which I will calculate accuracy of my models
correct_sentences = ['i don t know whether to go out or not', 'we went through the door to get inside',\
                     'they all had a piece of the cake', 'she had to go to court to prove she was innocent',\
                     'we were only allowed to visit at certain times', 'she went back to check she had locked the door',\
                     'can you hear me', 'do you usually eat cereal for breakfast', 'she normally chews with her mouth closed',\
                     'i m going to sell it on the internet']
In [11]:
##### required functions ######

# function that generates n-grams.
def generate_ngrams(s, n=1):
    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^a-zA-Z0-9\s\D]', ' ', s)
    chars = "`*{}'-\?^!%£[](),;:|#+.!$"
    for c in chars:
        s = s.replace(c, " ")
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

# function that pre-processes and tokienize the data
def tokienize(data, start , end, n):
    data_dict = {}
    for i in range(len(data)): 
        line = data[i].lower()
        line = line.replace('\n', " ")
        line = start + line + end
        line = generate_ngrams(line, n)
        data_dict.update({"S_"+str(i) : line})
    return data_dict

# function that read the data line by line and tokienizes it using function above.
def read_data(filepath, start, end, n):
    with open(filepath) as f:
        content = f.readlines()
        content = tokienize(content, start, end, n)
    return content

# function that create combinations of sentences for the test set.
def create_combinations(test_path, start, end):
    test_data = read_data(test_path, start, end, n=1) #read test data

    new_sen = []
    # for each sentence
    for ID, sentence in test_data.items():
        index_ = -2

        # single out the words to replace
        options = [sentence[index_].split('/')[0],sentence[index_].split('/')[1]]
        sentence = sentence[:index_] + sentence[index_+1 :]

        # for each word in the sentence.
        for index, word in enumerate(sentence):
            if word == '____': # if the word is '___', replace with one of the options.
                for option in options:
                    sentence[index] = option
                    new_sen.append(' '.join(sentence))
    #return new sentences
    return new_sen


# function that calculates the probability of words, or bigrams in the traning test.
def get_word_prob(train_data):
    # get a list of all words and concatenate this list
    t = list(train_data.values())
    t = list(itertools.chain.from_iterable(t))
    # use counter to count the frequency of all words
    counter = Counter(t)
    prob = {}
    # declare the denominator as the sum of all frequencies of all words.
    denom = sum(counter.values())

    # for each word in the corpus
    for word, count in counter.items():
        # calculate probability with (word count) / demominator
        p = {word : (count) / denom }
        prob.update(p)
    # return dictionary of probabilities.
    return prob


#function that calculates the probability of a sentence according to the unigram model.
def sen_prob_uni(train_data_uni, test_data_uni):
    W_probs = get_word_prob(train_data_uni) #calculate the probability of each word in the corpus.
    S_prob = {}
    for ID, sentence in test_data_uni.items(): # for each sentence,
        p_s = 0
        for word in sentence: #for each word in the sentence,
            if word not in W_probs: #if the word is not in the training set, add it, with a zero probability.
                W_probs.update({word: 0.0})
            p_s += np.log(W_probs[word]) # caclulate the log probability of each sentence.
            sen = ' '.join(sentence)
            sen = re.sub(r'\b(.+)\s+\1\b', r'\1', sen)
            S_prob.update({sen : p_s})
    # return a dictionary with sentence and probability
    return S_prob

# function that calculates the probability of each sentence according to the bi-gram model.
def sen_prob_bi(train_data_uni, train_data_bi, test_data_bi, smoothing):
    W_probs = get_word_prob(train_data_uni) #calculate probability of each word in the corpus
    bi_prob = get_word_prob(train_data_bi) #calculate probability of each bi-gram in the corpus
    S_prob = {}
    for ID, sentence in test_data_bi.items(): # for each bi-gram sentence,
        p_s = 1
        for word in sentence: #for each bi-gram in sentence
            if word not in bi_prob: # if the bi-gram is not in the training corpus, add it with zero probability.
                bi_prob.update({word: 0.0})
            denom = word.split(' ')[0] # declare the denominator, which is the first word of the bi-gram.
            if denom not in W_probs: # if the word is not in the total corpus, add it with zero probability.
                W_probs.update({denom :0.0})
            if smoothing == 'no': # if we don't want to smooth the data,
                p_s *= bi_prob[word]/ W_probs[denom] #calculate probability by multiplying prob(bi-gram) / prob(first word of bigram)
            elif smoothing == 'yes': # if we want to smooth the data
                p_s *= (bi_prob[word] + 1) / (W_probs[denom] + len(list(W_probs.keys()))) #add 1 to nominator and number of words in the dictionary to denominator
            sen =' '.join(sentence)
            sen = re.sub(r'\b(.+)\s+\1\b', r'\1', sen)
            S_prob.update({sen : p_s})
    # return dictionary with sentences and their probabilities.
    return S_prob

# function that returns the sentence with the higher probability between the two options
def return_top_sentences(sentence_dict):
    # get a list of all test sentences
    d = list(sentence_dict.items())
    n = 2
    # split the in consecutive pairs so that each pair is contained in a sublist.
    d = [d[i:i + n] for i in range(0, len(d), n)]
    mx = []
    # for each sublist
    for i in d:
        # create another list that contains probabilities values.
        p = [i[0][1],i[1][1]]
        # if the probabilities are equal and equal to zero
        if p[0] == p[1] == 0:
            # return ZERO
            mx.append('ZERO')
        # if their are equal but different from zero
        elif p[0] == p[1] != 0:
            #return EQUAL
            mx.append('EQUAL')
        else:
            #if they are different, return the sentence that has the higher probability.
            index = p.index(max(p))
            sen = i[index][0]
            sen = sen[:-5] # delete start and end tags from sentence.
            sen = sen[4:]
            mx.append(sen)
    return mx

# unigram language model function
def ULM(train_data, test_data):
    sentence_dict = sen_prob_uni(train_data, test_data) #get probability of each sentence
    sentences = return_top_sentences(sentence_dict) #return sentences with higher probability
    return sentences

# bigram language model
def BLM(train_data_uni, train_data_bi, test_data_bi, smoothing):
    sentence_dict = sen_prob_bi(train_data_uni, train_data_bi, test_data_bi, smoothing)
    sentences = return_top_sentences(sentence_dict)
    return sentences

# function that returns accuracy .
def accuracy(pred, correct_sentences):
    correct = 0
    for sen in pred: # for each sentence returned by the language model,
        if sen in correct_sentences: # if the sentence is in the list of correct sentences,
            correct += 1 #add 1
        elif sen == 'EQUAL': # if sentences have tie values, add 0.5
            correct += 0.5
            # else, do nothing
    # calculate accuracy by dividing the score reached by the number of total sentences
    accuracy = correct / len(correct_sentences)
    # return accuracy
    return accuracy


# this function initialized the project by
def initialise(train_path, test_path, start, end, n):

    train_data = read_data(train_path, start, end, n) # reading and tokienizing training data

    test_d = create_combinations(test_path, start, end) #reading and creating combinations of the testing data
    test_data = {}
    for i in range(len(test_d)): #tokienizing the testing data.
        sen = generate_ngrams(test_d[i], n)
        test_data.update({'S_'+ str(i):sen})
    return train_data, test_data

2.1.1 Project Execution

Unigram Language Model

In a ULM each word is processed by itself and its likelihood is assumed to be independent from the likelihood of other words in the sentence.

My ULM algorithm started with tokenizing each sentence in the training and testing text in unigrams or single words. Consequently the probability of each unigram in the training text was calculated by $$P(word) = \frac{count(word)}{N}$$ where $N$ is the $N$ total number of words in the training text. These probabilities where then used to calculate the likelihood of each sentence in the testing text by applying the formula $P (x) = \prod P (x_n)$.

After calculating the likelihood of each sentence, these were split into pairs and the sentence with the higher likelihood within the pair was returned as the predicted correct sentence. The ULM obtained an accuracy score of 0.6, meaning that the model predicted the correct sentence 60% of the times. This score can be improved by introducting context into the model with a BLM.

In [12]:
## read training and testing data as unigrams and bi-grams
train_data_uni, test_data_uni = initialise(train_path, test_path, start = start, end = end,  n = 1)
train_data_bi , test_data_bi = initialise(train_path, test_path, start = start , end = end, n = 2)
In [13]:
#### UNIGRAM LANGUAGE MODEL ####
print('UNIGRAM LANGUAGE MODEL', '\n')
# return correct sentences
pred_uni = ULM(train_data_uni, test_data_uni)
# calculate accuracy
accuracy_uni = accuracy(pred_uni, correct_sentences)
print('The correct sentences returned by the unigram language model are: ', pred_uni, '\n', 'Accuracy value: ', accuracy_uni, '\n')
UNIGRAM LANGUAGE MODEL 

The correct sentences returned by the unigram language model are:  ['i don t know whether to go out or not', 'we went through the door to get inside', 'they all had a peace of the cake', 'she had to go to court to prove she was innocent', 'we were only allowed to visit at certain times', 'she went back to check she had locked the door', 'can you here me', 'do you usually eat serial for breakfast', 'she normally choose with her mouth closed', 'i m going to sell it on the internet'] 
 Accuracy value:  0.6 

Bigram Language Model

In a Bigram or Markov language model, the likelihood of a word in a sentence is assumed to be conditioned by the word that precedes it. Therefore, in a BLM, the testing and training text is tokenized into bigrams, or pairs of words, and the likelihood of each unigram as well bigram is calculated.

The probability of each word given the previous one, is obtained by $$P (x_n |x_{n−1}) = \frac{P (x_{n-1},x_n )}{P(x_{n−1})}$$

which is simply the likelihood of each bigram divided by the likelihood of the bigram’s first word. Once again, for each test sentence its likelihood is calculated and the predicted correct sentences are returned. As expected, the accuracy score of the BLM improved to 0.8, meaning that the model was correct 80% of the times.

Because I assigned zero probability to unseen bigrams and unigrams, the BLM returned two sets of test sentences to have zero probability. This issue should be addressed and solved by the BLM with Laplace smoothing discussed below.

In [14]:
#### BI-GRAM LANGUAGE MODEL ####
print('BIGRAM LANGUAGE MODEL', '\n')
pred_bi = BLM(train_data_uni, train_data_bi, test_data_bi, smoothing = 'no')
accuracy_bi = accuracy(pred_bi, correct_sentences)
print('The correct sentences returned by the bigram language model are: ', pred_bi, '\n', 'Accuracy value: ', accuracy_bi, '\n')
BIGRAM LANGUAGE MODEL 

The correct sentences returned by the bigram language model are:  ['i don t know whether to go out or not', 'we went through the door to get inside', 'they all had a piece of the cake', 'she had to go to court to prove she was innocent', 'we were only allowed to visit at certain times', 'she went back to check she had locked the door', 'can you hear me', 'ZERO', 'ZERO', 'i m going to sell it on the internet'] 
 Accuracy value:  0.8 

Bigram Language Model with Laplace Smoothing

The BLM with Laplace smoothing follows the same procedure as the BLM described above however, this time $P(x_n|x_{n−1})$ is given by $\frac{P(x_{n−1},x_n )+1}{P(x_{n−1})+|V|}$. This model avoids zero probabilities by assigning a small probability of $\frac{1}{|V|}$ to unseen unigrams or bigrams.

This makes sure that the overall sentences’ likelihood can still be calculated and return a non-zero value. As expected, Laplace smoothing allowed the language model to achieve an accuracy score of 0.9, making it the best performing model among the three.

In [15]:
#### BI-GRAM LANGUAGE MODEL WITH LAPLACE SMOOTHING ####
print('BIGRAM LANGUAGE MODEL WITH LAPLACE SMOOTHING', '\n')
pred_laplace = BLM(train_data_uni, train_data_bi, test_data_bi, smoothing = 'yes')
accuracy_laplace = accuracy(pred_laplace, correct_sentences)
print('The correct sentences returned by bigram language model with Laplace smoothing are: ', pred_laplace, '\n', 'Accuracy value: ', accuracy_laplace , '\n')

print('Time: ', datetime.now()- start_c)
BIGRAM LANGUAGE MODEL WITH LAPLACE SMOOTHING 

The correct sentences returned by bigram language model with Laplace smoothing are:  ['i don t know whether to go out or not', 'we went through the door to get inside', 'they all had a piece of the cake', 'she had to go to court to prove she was innocent', 'we were only allowed to visit at certain times', 'she went back to check she had locked the door', 'can you hear me', 'do you usually eat cereal for breakfast', 'she normally choose with her mouth closed', 'i m going to sell it on the internet'] 
 Accuracy value:  0.9 

Time:  0:01:14.746530